class: center, middle, inverse, title-slide #
Support Vector Machines
## 🚧 🛣️ 🦿 ### Applied Machine Learning in R
Pittsburgh Summer Methodology Series ### Lecture 4-A July 22, 2021 --- class: inverse, center, middle # Overview <style type="text/css"> .onecol { font-size: 26px; } .twocol { font-size: 24px; } .remark-code { font-size: 24px; border: 1px solid grey; } a { background-color: lightblue; } .remark-inline-code { background-color: white; } </style> --- class: onecol ## Maximal Margin Classifier How do we predict the class of a new data point? <img src="data:image/png;base64,#maxmargin1.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier With one feature, we need to find a point that separates the classes <img src="data:image/png;base64,#maxmargin2.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier But there are many possible decision points, so which should we use? <img src="data:image/png;base64,#maxmargin3.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier One option is to find the point with the **maximal margin** between the classes <img src="data:image/png;base64,#maxmargin4.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier With two features, we need a 2D plot and a decision **line** <img src="data:image/png;base64,#maxmargin5.png" width="80%" /> --- class: onecol ## Maximal Margin Classifier With three features, we need a 3D plot and a decision **plane** <img src="data:image/png;base64,#3d_plane.gif" width="50%" /> --- class: onecol ## Maximal Margin Classifier With four or more features, we can't plot it but we need a decision **hyperplane** -- .bg-light-yellow.b--light-red.ba.bw1.br3.pl4[ **Caution:** You may hurt yourself if you try to imagine what a hyperplane looks like. ] -- .pt1[ **Margins still exist** in higher-dimensional space and we still want to maximize them ] - Our goal is thus to locate the class-separating hyperplane with the largest margin - The math behind this is beyond the scope of our workshop, but that's the idea -- The hyperplane allows us to **classify new observations** (which side of it do they fall on?) - Since the dimensions of this space are determined by the features, we are exploring "feature space" and looking for regions occupied by each class --- class: onecol ## Maximal Margin Classifier Only the observations that define the margin, called , are used Thus, it (and related methods) focus on the most ambiguous/difficult examples <img src="data:image/png;base64,#maxmargin8.png" width="80%" /> --- class: onecol ## Maximal Margin Classifier This means that **outliers can have an outsized impact** on what is learned For instance, this margin is likely to misclassify examples in new data <img src="data:image/png;base64,#maxmargin9.png" width="80%" /> --- class: onecol ## Support Vector Classifier The  (SVC) innovates by allowing the outlier to be misclassified This will **increase bias** (training errors) but hopefully **decrease variance** (testing errors) <img src="data:image/png;base64,#svc1.png" width="80%" /> --- class: onecol ## Support Vector Classifier SVCs also work better than MMCs when the classes are **not perfectly separable** An MMC's hyperplane is never going to separate these classes without errors <img src="data:image/png;base64,#svc2.png" width="80%" /> --- class: onecol ## Support Vector Classifier But if we allow a few errors and points within the margin... ...we may be able to find a hyperplane that generalizes pretty well <img src="data:image/png;base64,#svc3.png" width="80%" /> --- class: onecol ## Support Vector Classifier When points are on the wrong side of the margin, they are called "violations" A **soft margin** allows violations, whereas a **hard margin** does not (as in MMC) SVCs have a hyperparameter `\(C\)` that controls margin "softness" on a continuum -- <p style="padding-top:10px;">This hyperparameter controls the number and magnitude of violations allowed</p> - A **lower `\(C\)` value** makes the margin harder (allows fewer violations) As a result, the model has **lower bias** but is more likely to overfit for higher variance - A **higher `\(C\)` value** makes the margin softer (allows more violations) As a result, the model has more bias but is less likely to overfit for **lower variance** --- class: onecol ## Support Vector Machine So far, MMC and SVCs have both used linear (e.g., flat) hyperplanes But there are many times when the classes are not **linearly separable** <img src="data:image/png;base64,#svm1.png" width="80%" /> .footnote[[1] Good luck separating these classes with a single decision point...] --- class: onecol ## Support Vector Machine But if we enlarge the feature space, the classes might then become linearly separable There are many ways to do this enlarging, but one is to add polynomial expansions <img src="data:image/png;base64,#svm2.png" width="80%" /> --- class: onecol ## Support Vector Machine The classes are now linearly separable in this new enlarged feature space! <img src="data:image/png;base64,#svm3.png" width="80%" /> --- class: onecol ## Support Vector Machine The  (SVM) allows us to efficiently enlarge the feature space - Part of what makes SVMs efficient is they **only consider the support vectors** - They also use **kernel functions** to quantify the similarity of pairs of support vectors<sup>1</sup> <p style="padding-top:15px;">The SVC can actually be considered a simple version of the SVM with a <b>linear kernel</b></p> - A linear kernel essentially quantifies similarity using the Pearson correlation `$$K(x_i, x_{i'}) = \sum_{j=1}^p x_{ij}x_{i'j}$$` .footnote[[1] These similarity estimates are used to efficiently find the optimal hyperplane but that process is complex.] --- class: onecol ## Support Vector Machine It is common to also use **nonlinear** kernels, such as the  `$$K(x_i, x_{i'}) = (1 + \sum_{j=1}^p x_{ij} x_{i'j})^d$$` With larger values of `\(d\)`, the decision boundary can become more flexible/complex - You are essentially adding polynomial expansions of degree `\(d\)` to each predictor - You have expanded the feature space and may now have linear separation - This is the same idea we just used in fitting a hyperplane in the `\(x\text{-by-}x^2\)` space! .footnote[[1] When `\\(d=1\\)`, the polynomial kernel reduces to the linear kernel and SVM becomes SVC again.] --- class: onecol ## Support Vector Machine Perhaps the most common kernel is the  (RBF) or radial kernel `$$K(x_i,x_{i'})=\exp\left(-\gamma\sum_{j=1}^p(x_{ij}-x_{i'j})^2\right)$$` -- The intuition here is that similarity is weighted by how close the observations are - Only support vectors near new observations influence classification strongly The RBF kernel actually computes similarity between points in *infinite* dimensions<sup>1</sup> As the `\(\gamma\)` hyperparameter increases, the more nonlinear and complex fit becomes .footnote[[1] This is part of why RBF is so popular: who needs more than infinite dimensions?!] --- ## Support vector regression --- ## Applied Example --- ## Live Coding Activity --- ## Hands-on Activity --- ## Break and timer